Machine Learning-Driven Predictive Modeling for Opioid Use Disorder in Chronic Pain Patients on Long-Term Opioid Use

Overview

This predictive model aims to identify opioid use disorder among chronic pain patients on long-term opioid therapy by incorporating diverse risk factors, grounded in the biopsychosocial model of pain. It provides a robust predictive framework and serves as a foundation for more advanced machine learning models that will incorporate a broader range of psychiatric conditions, ultimately enhancing early identification and intervention strategies for at-risk patients. For this project, I am collaborating with two psychology experts specializing in substance use disorders, including opioid use disorder.

  • Link to final project GitHub repository) : https://github.com/yoonjae-hub/BMIN503_Final_Project.git

1. Introduction

Nearly 51.6 millions of Americans living with chronic pain (Rikard, Strahan et al. 2023). Managing chronic non-cancer pain (CNCP) is particularly challenging secondary to its complex neurobiological and psychosocial mechanisms (Fillingim, Bruehl et al. 2014), profoundly impacting patients’ physical, psychological, and social well-being (Genova, Dix et al. 2020). CNCP is defined as persistent or recurrent pain lasting three months or longer, and encompasses a range of non-malignant conditions, including neuropathic pain, rheumatoid arthritis, lower back pain, osteoarthritis, fibromyalgia, and various other conditions. A major concern with CNCP is the ongoing use of opioids. Opioids are widely prescribed for chronic pain, but can cause adverse outcomes, such as unintentional overdose, misuse, addiction, and even death (Dahlhamer, Connor et al. 2021, Dowell, Ragan et al. 2022). As many as 25% of patients treated with opioids for chronic pain are known to develop opioid use disorder (OUD) (Cordier Scott and Lewis 2016).

Leveraging AI in healthcare can transform OUD management by analyzing large datasets to develop machine learning (ML) algorithms that identify individuals at risk. This approach enables earlier intervention and more personalized care. Recently, the FDA approved AvertD, an ML-based risk score model designed to assess genetic predisposition to OUD using a prescription-only genotyping test (U.S. Food and Drug Administration 2024). However, AvertD relies solely on buccal swab specimens, excluding significant socioeconomic, physical, and psychological factors associated with OUD. Additionally, AvertD is designed for first-time opioid users, while OUD incidence is higher among individuals on long-term opioid therapy. This highlights the need for an ML model that targets chronic opioid users and integrates broader risk factors for better accessibility and applicability. Therefore, this study aims to develop an ML model to predict OUD among chronic pain patients who have been on opioid therapy, integrating biopsychosocial factors that are known to influence OUD risk.

This problem is inherently interdisciplinary because pain, particularly in the context of chronic pain and opioid use disorder (OUD), cannot be fully understood or addressed from a single disciplinary perspective. The biopsychosocial model emphasizes that pain is influenced not only by biological factors but also by psychological and social dimensions. Therefore, incorporating experts from psychology is essential to explore the psychological underpinnings of pain, such as emotional distress, coping mechanisms, and the role of comorbid psychiatric conditions like anxiety, depression, or substance use disorders.

2. Methods

Data source This project used dataset that were collected for a larger parent study that evaluated the phenotypic and genotypic profiles of CNCP patients on opioid therapy who did and did not develop an OUD. Data were collected between November 2012 and September 2018. This study includes 1,356 patients who were receiving long-term (6 months or more) opioid therapy (LTOT) to manage CNCP. Subjects were collected from three different regions (Pennsylvania, Washington, and Utah) in the United States. The eligibility criteria for the parent study were: individuals who were (1) aged 18 or older, (2) Caucasian and of European descent (defined as 3 or 4 grandparents were European), (3) experiencing CNCP of musculoskeletal or neuropathic origin persisting for at least 6 months, (4) having no history of substance abuse (except nicotine) before starting LTOT, and (5) having received LTOT to treat their pain for at least the past 6 months. Due to the genotypic and phenotypic objectives of the parent study, persons who are non-Caucasian or with a previous history of substance use disorder were excluded from the study. Additional criteria for exclusion included severe psychiatric conditions that hindered the ability to provide informed consent or complete the questionnaire; and individuals experiencing pain conditions not arising from musculoskeletal/neuropathic origin (e.g., cancer, gynecologic issues, abdominal discomfort, visceral concerns, dental problems, neuropathic pain due to metabolic disease).

Determination of OUD For this analysis, participants were grouped as either ‘cases’ or ‘controls’ based on whether or not they developed OUD after initiating LTOT. The control group included patients who did not have evidence of opioid abuse or meet the criteria for OUD at the time of assessment. Their electronic health records (EHR) were reviewed monthly for 12 consecutive months after enrolling in the study to ensure they did not develop an OUD after completing the baseline assessment. To be considered controls, patients receiving LTOT needed to have negative urine drug screens (which detect opioid metabolites and other illicit drugs), have no record of current or past SUD, and have evidence of no aberrant drug-related behaviors assessed using an expert developed checklist (Cheatle, O’Brien et al. 2013). Cases were patients who did not have a history of SUD (except nicotine) prior to beginning LTOT and currently met DSM-IV criteria for ‘opioid dependence’, which were obtained at the time of enrolling in the study.

Feature Selection Given the biopsychosocial aspects of OUD, the initial analysis will explore variables such as demographic factors (age, sex, race, ethnicity, marital status, education, employment, financial situation), pain (severity, interference), nicotine dependence, psychiatric conditions (depression, anxiety, pain catastrophizing, mental defeat, suicidality), and social support. To check multicollinearity among these variables, variance inflation factor (VIF) was used.

Implementation of ML This study will use a supervised learning approach to develop the predictive model. Given the binary outcome, algorithms such as logistic regression, random forest, and gradient boosting will be employed. To compare performance, each algorithm will be tested using area under the receiver operating characteristic curve (ROC-AOC). A potential challenge in this approach is missing data, as some variables rely on total scores. Missing data in these variables can lead to more extensive data gaps since individual items are not considered separately. To address this, missing data imputation techniques will be considered. Also, to ensure model generalization and detect potential overfitting/underfitting, cross-validation methods (k-fold cross-validation) will be applied.

2.1 Loading Packages

library(tidyverse) 
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2) # For graphs
library(lubridate) # For manipulating dates
library(gtsummary) #Summary statistics
library(RColorBrewer) # For coloring the plots
library(cowplot) # For combining multiple plots into one

Attaching package: 'cowplot'

The following object is masked from 'package:lubridate':

    stamp
library(ggpubr) # Combining multiple graphs into one figure 

Attaching package: 'ggpubr'

The following object is masked from 'package:cowplot':

    get_legend
library(pROC) # For cross-validation
Type 'citation("pROC")' for a citation.

Attaching package: 'pROC'

The following objects are masked from 'package:stats':

    cov, smooth, var
library(dplyr) # For data cleaning
library(gtsummary)
library(haven)
library(rsample)
library(car)
Loading required package: carData

Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
✔ broom        1.0.7     ✔ recipes      1.1.0
✔ dials        1.3.0     ✔ tune         1.2.1
✔ infer        1.0.7     ✔ workflows    1.1.4
✔ modeldata    1.4.0     ✔ workflowsets 1.1.0
✔ parsnip      1.2.1     ✔ yardstick    1.3.1
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ car::recode()     masks dplyr::recode()
✖ car::some()       masks purrr::some()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Search for functions across packages at https://www.tidymodels.org/find/
library(dotwhisker)
library(parsnip)

2.2 Variables

#Predictors
#Demographic - gender (DEM002), age (DEM003), marital status (DEM004), living along (DEM005), education (DEM006), employment (DEM007), financial status (DEM008)
#Plesae note that due to the genotypic and phenotypic objectives of the parent study, persones who are non-Caucasian or with a previous history of substance use disorder were excluded from the study)

demp_v2 <- read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/demp_v2.sav")

#Pain - severity, interference - Brief Pain Inventory
bpip <- read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/bpip.sav")

#Smoking- Fagerstrom scale
fnts<- read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/fnts.sav")

#Alcohol
aadidm <- read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/aadidm.sav")

#Anxiety/Depression-PHQ-4
phq4 <-read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/phq4.sav")

#Pain catastrophizing (Subscale) - Coping strategies questionnaire
copsq<- read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/copsq.sav")

#Mental defeat-Pain perception scale
pps <-read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/pps.sav")

#Suicidality - SBQ-R
sbqr <- read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/sbqr.sav")

#Social support - Duke social support index
dssi_v2 <- read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/dssi_v2.sav")

#Outcome - Opioid Use Disorder
pevc <- read_sav("/Users/yoonjaelee/Desktop/a_Upenn_PhD_2023-2026/2nd year_Fall 2024/EHR class/Final Project/R01 Data/pevc.sav")

2.3 Cleaning and Transforming Data for Analysis

#Demographic - gender, age, marital status, living status, education, employment, financial status
demo <-demp_v2 %>%
  select(DEM002, DEM003, DEM004, DEM005, DEM006, DEM007, DEM008, SubjID)
demo <- demo %>%
  mutate(across(
    where(~inherits(., "haven_labelled")), 
    as_factor
  ))
demo <- demo %>%
  mutate(across(
    where(~inherits(., "haven_labelled") & is.numeric(.)),
    as.numeric
  ))

demo$DEM003[which(demo$DEM003 == "(X)Subject refused to answer")] <- NA
demo$DEM005[which(demo$DEM005 == "Do not know/refused")] <- NA
demo$DEM005 <- droplevels(demo$DEM005)
levels(demo$DEM005)
[1] "Live alone"                 "One other person"          
[3] "More than one other person"
demo$DEM006[which(demo$DEM006 == "(X)Subject refused to answer")] <- NA
demo$DEM006 <- droplevels(demo$DEM006)
levels(demo$DEM006)
[1] "Grade 6 or less"                                
[2] "Grade 7 to 12"                                  
[3] "Graduated high school or high school equivalent"
[4] "Part college"                                   
[5] "Graduated 2 years college"                      
[6] "Graduated 4 years college"                      
[7] "Part graduate/professional school"              
[8] "Completed graduate/professional school"         
demo$DEM007 <- as.character(demo$DEM007)
demo$DEM007[which(demo$DEM007 == "Do not know/refused")] <- NA
demo$DEM007 <- as_factor(demo$DEM007)
demo$DEM007 <- droplevels(demo$DEM007)
levels(demo$DEM007)
[1] "Yes" "No" 
demo$DEM008 <- as.character(demo$DEM008)
demo$DEM008[which(demo$DEM008 == "Do not know/Refused")] <- NA
demo$DEM008 <- as_factor(demo$DEM008)
demo$DEM008 <- droplevels(demo$DEM008)
levels(demo$DEM008)
[1] "Ca not make ends meet"         "Have just enough to get along"
[3] "Are comfortable"              
#transforming to factor (all, except DEM003 - age)
demo$DEM002 <- as_factor(demo$DEM002)
demo$DEM003 <- as.numeric(demo$DEM003)
demo$DEM004 <- as_factor(demo$DEM004)
demo$DEM005 <- as_factor(demo$DEM005)
demo$DEM006 <- as_factor(demo$DEM006)
demo$DEM007 <- as_factor(demo$DEM007)
demo$DEM008 <- as_factor(demo$DEM008)


#checking (examples)
class(demo$DEM004)       # Should show "factor"
[1] "factor"
levels(demo$DEM005)      # no 'do not know'refused'
[1] "Live alone"                 "One other person"          
[3] "More than one other person"
is.numeric(demo$DEM005)  # Should return FALSE
[1] FALSE
#pain
#pain-severity
pain_severity <- bpip %>%
  filter(month == 0) %>%
  mutate(pain_severity_sum = if_else(if_any(c(BPI001, BPI002, BPI003, BPI004), ~ . == -9),
                                     NA_real_,
                                     BPI001 + BPI002 + BPI003 + BPI004)) %>%
  select(pain_severity_sum, SubjID)

#pain-interference
pain_interf <- bpip %>%
  filter(month == 0) %>%
  mutate(pain_interf_sum = if_else(if_any(c(BPI007A, BPI007B, BPI007C, BPI007D, BPI007E, BPI007G), ~ . == -9),
                                   NA_real_, # changing -9 to NA value
                                   BPI007A + BPI007B + BPI007C + BPI007D + BPI007E + BPI007G)) %>%
  select(pain_interf_sum, SubjID)

#smoking : conditional survey (if FNTS000 entry question is 0, rest of questions are a system missing value; No missing value in FNTS000)
missing_fnts <- sum(is.na(fnts$FNTS000))
missing_fnts #for the initial screening question, there is no missing data.
[1] 0
# 001, 004: Multiple-choice items (0-3) / # 2,3,5,6: Yes/No items (0-1)
fnts_cleaned <- fnts %>%
  mutate(
    fnts_sum = ifelse(FNTS000 == 0, NA,
                         FNTS001 + FNTS004 + 
                         FNTS002 + FNTS003 + FNTS005 + FNTS006)) %>%
  replace_na(list(fnts_sum=0))

fnts_cleaned <- fnts_cleaned %>%
  filter(month==0)

fnts_cleaned <- fnts_cleaned %>%
  mutate(dependence_level = case_when(
    fnts_sum == 0 ~ "Non-smoker",
    fnts_sum >= 1 & fnts_sum <= 2 ~ "Low",
    fnts_sum >= 3 & fnts_sum <= 4 ~ "Low to Moderate",
    fnts_sum >= 5 & fnts_sum <= 7 ~ "Moderate",
    fnts_sum >= 8 ~ "High")) %>%
  select(SubjID, fnts_sum, dependence_level)

fnts_cleaned_2 <- fnts_cleaned %>%
  select(SubjID, dependence_level) #sort out only two variables

#Alcohol use - more conservative, in the last 3 months
aadidm_cleaned <- aadidm %>%
  filter(month == 0) %>%
  mutate(
    alcohol = ifelse(AAD001 == 1, 1, 0),
    alcohol = replace(alcohol, is.na(alcohol), 0)  # Replace NA with 0
  ) %>%
  select(SubjID, alcohol)
table(aadidm_cleaned$alcohol)

  0   1 
807 517 
#PHQ-4 (Anxiety/Depression)
phq4 <- phq4 %>%
  filter (month == 0)
#PHQ-4: dichotomizing Anxiety, cutoff >=3
#PHQ-4: dichotomizing Depression, cutoff >=3
phq4 <- phq4 %>%
  mutate(Anxiety = ifelse(PHQ001 + PHQ002 >= 3, 1, 0),
         Depression = ifelse(PHQ003 + PHQ004 >= 3, 1, 0))
phq4_cleaned <- phq4 %>%
  select(Anxiety, Depression, SubjID)

phq4_cleaned$Anxiety <- as.factor(phq4_cleaned$Anxiety)
phq4_cleaned$Depression <- as.factor(phq4_cleaned$Depression)

#CSQ - Pain catastrophizing subscale only (score already calculated in the orignal dataset: CATAT)
ca_catastro <- copsq %>%
  filter(month == 0) %>%
  mutate(ca_sum = if_else(if_any(c(COP005, COP012, COP014, COP028, COP038, COP042), ~ . == -9),
                                   NA_real_,
                                   COP005 + COP012 + COP014 + COP028 + COP038 + COP042)) %>%
  select(ca_sum,SubjID)

##PSPS -HAVE some missing vales (pps_sum)
pps_cleaned <-pps %>%
  mutate(pps_sum= rowSums(across(starts_with("PPS"), ~ ifelse(. == -9, NA, .), .names = "new_{.col}")), na.rm= FALSE)
pps_cleaned <- pps_cleaned%>%
  select(pps_sum, SubjID)

missing_pps <- sum(is.na(pps_cleaned$pps_sum))
missing_pps #n=85
[1] 85
#Suicidality: cutoff >=8 (higher risk) for clinical samples / no missing values, but '-9' refused to answer-> Total score is NA whenever a -9 value is present in any of the specified columns. (NA in total score = only 6 data)
#scoring website: https://www.ncbi.nlm.nih.gov/books/NBK571017/box/ch3.b11/?report=objectonly

sbqr_cleaned <- sbqr %>%
  filter(month==0) %>%
  mutate(
    SBQ001 = case_when(
      SBQ001 %in% c(3, 4) ~ 3, #regrouping
      SBQ001 %in% c(5, 6) ~ 4,
      TRUE ~ SBQ001),
    SBQ003 = case_when(
      SBQ003 %in% c(2, 3) ~ 2,
      SBQ003 %in% c(4, 5) ~ 3,
      TRUE ~ SBQ003))

sbqr_cleaned <- sbqr_cleaned %>%
     mutate(sbqr_sum = if_else(
      rowSums(select(., SBQ001, SBQ002, SBQ003, SBQ004) == -9) > 0,
      NA_real_,
      rowSums(select(., SBQ001, SBQ002, SBQ003, SBQ004), na.rm = TRUE))) %>%
  select(SubjID, sbqr_sum)

#Categorize suicidality risk based on sbqr_sum (>=8; high risk, <8: low risk, NA=> NA)(>=8; high risk, <8: low risk, NA=> NA); missing data = 6ea
sbqr_cleaned <- sbqr_cleaned %>%
  mutate(
    suicidality_risk = case_when(
      sbqr_sum >= 8 ~ "High Risk",
      sbqr_sum < 8 ~ "Low Risk",
      is.na(sbqr_sum) ~ NA_character_))

suicidality <- sbqr_cleaned %>%
  select(SubjID, suicidality_risk)

#social support (DSSI) : 3 subscales scores
dssi_cleaned <- dssi_v2 %>%
  filter(month==0) %>%
  select(SubjID, SIDSSI, SSDSSI, ISDSSI)
#social interaction (SIS, 4 item) : assessing the amount of time: 3 point likert scale (missing data: NONE)
## dssi_cleaned$SIDSSI

#subjective social support (SSS, 7 items): assessing depth or quality of relationship: 3point (missing data: 14)
## dssi_cleaned$SSDSSI

#Instrumental social support (ISS, 12 items): degree to which others can be counted upon with daily activities;0/1; (missing data: 4)
## dssi_cleaned$ISDSSI

#outcome: opioid use disorder (cohort - 1: case=OUD, 0: control=no OUD)
cohort <- read.csv("/Users/yoonjaelee/Desktop/Dr.CHEATLE ML PROJECT/pevc_compiled.csv")
cohort_cleaned <- cohort %>%
  select(SubjID,cohort)
cohort_cleaned$cohort <- as.factor(cohort_cleaned$cohort) #changing the class to categorical values.
class(cohort_cleaned$cohort) #factor
[1] "factor"
#removing 16 Nas
cohort_cleaned <- cohort_cleaned[!is.na(cohort_cleaned$cohort), ]
cohort_cleaned <- cohort_cleaned %>%
  filter(SubjID != 0) #remove SubjID==0
table(cohort_cleaned$cohort)  #cohort 0:1042; 1:486

   0    1 
1042  486 
##Merging data_using left_join
merging1 <- left_join(cohort_cleaned, demo, by = "SubjID")
## Note that when merging demo file with cohort_cleaned file, there are missing data -> remove; not registered for the study. (removed 173 patients)
merging1 <- merging1 %>%
  drop_na()
merging2 <- left_join(merging1, pain_severity, by = "SubjID")
merging3 <- left_join(merging2, pain_interf, by = "SubjID")
merging4 <- left_join(merging3, fnts_cleaned_2, by = "SubjID")
merging5 <- left_join(merging4, phq4_cleaned, by = "SubjID")
merging6 <- left_join(merging5, ca_catastro, by = "SubjID")
merging7 <- left_join(merging6, pps_cleaned, by = "SubjID")
merging8 <- left_join(merging7, suicidality, by = "SubjID")
merging9 <- left_join(merging8, aadidm_cleaned, by = "SubjID")
final_data <- left_join(merging9, dssi_cleaned, by = "SubjID") #final sample N= 1331
table(final_data$cohort) #901/430

  0   1 
901 430 

2.4 Exploratory Data Analysis (Distribution/Association)

Based on distribution analysis, most of variables were skewed. However, given the large sample size, we can rely on the Central Limit Theorem to approximate normalilty.

#Categorical variables
#1-1) Fnts - Visualization
fnts_eda <- left_join(cohort_cleaned, fnts_cleaned, by = "SubjID") %>%
  filter(!is.na(dependence_level), !is.na(cohort))

ggplot(fnts_eda, aes(x = dependence_level, fill = as.factor(cohort))) +
  geom_bar(position = "dodge") +
  labs(x = "Dependence Level", y = "Count", fill = "OUD Status") +
  scale_fill_discrete(labels = c("Control", "Case")) +
  theme_minimal()

oud_smoking <- table(final_data$cohort, final_data$dependence_level)
print(oud_smoking) # All cells >5. okay to perform chi-square test
   
    High Low Low to Moderate Moderate Non-smoker
  0   20  54              47       59        694
  1   43  49              75      123        109
chi_oud_smoking <- chisq.test(oud_smoking)
print(chi_oud_smoking) # p-value < 2.2e-16 -> Significantly associated.

    Pearson's Chi-squared test

data:  oud_smoking
X-squared = 332.86, df = 4, p-value < 2.2e-16
#2-1) phq4 - Distribution
phq4_eda <- left_join(cohort_cleaned, phq4_cleaned, by = "SubjID") %>%
  filter(!is.na(Anxiety), !is.na(Depression))

ggplot(phq4_eda, aes(x = Anxiety, fill = as.factor(cohort))) +
    geom_bar(position = "dodge") +
    labs(x = "Anxiety", y = "Count", fill = "OUD Status") +
   scale_fill_discrete(labels = c("Control", "Case")) +
   scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
    theme_minimal()

ggplot(phq4_eda, aes(x = Depression, fill = as.factor(cohort))) +
    geom_bar(position = "dodge") +
    labs(x = "Depression", y = "Count", fill = "OUD Status") +
   scale_fill_discrete(labels = c("Control", "Case")) +
   scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
    theme_minimal()

#2-2) phq4 - Association
oud_anxiety <- table(final_data$cohort, final_data$Anxiety)
print(oud_anxiety) # expected counts >5. okay to perform chi-square test
   
      0   1
  0 473 396
  1 144 248
chi_oud_anxiety <- chisq.test(oud_anxiety, correct=F)
print(chi_oud_anxiety) # p-value = 2.391e-09 -> Significantly associated.

    Pearson's Chi-squared test

data:  oud_anxiety
X-squared = 33.852, df = 1, p-value = 5.947e-09
oud_dep <- table(final_data$cohort, final_data$Depression)
print(oud_dep) # expected counts >5. okay to perform chi-square test
   
      0   1
  0 517 352
  1 156 236
chi_oud_dep <- chisq.test(oud_dep, correct=F)
print(chi_oud_dep) # p-value = 3.518e-11 -> Significantly associated.

    Pearson's Chi-squared test

data:  oud_dep
X-squared = 42.117, df = 1, p-value = 8.595e-11
#3-1) SBQR-R - Distribution
sbqr_eda <- left_join(cohort_cleaned, sbqr_cleaned, by = "SubjID")%>%
  filter(!is.na(suicidality_risk))

ggplot(sbqr_eda, aes(x = suicidality_risk, fill = as.factor(cohort))) +
    geom_bar(position = "dodge") +
    labs(x = "Suicidality risk", y = "Count", fill = "OUD Status") +
   scale_fill_discrete(labels = c("Control", "Case")) +
   scale_x_discrete(labels = c("0" = "No", "1" = "Yes")) +
    theme_minimal()

#3-2) SBQR-R - Association
oud_sui <- table(final_data$cohort, final_data$suicidality_risk)
print(oud_sui) # expected counts >5. okay to perform chi-square test
   
    High Risk Low Risk
  0        58      429
  1        77      161
chi_oud_sui <- chisq.test(oud_sui, correct=F)
print(chi_oud_sui) # p-value = 3.123e-11 -> Significantly associated.

    Pearson's Chi-squared test

data:  oud_sui
X-squared = 44.092, df = 1, p-value = 3.133e-11
#Continuous variables 
#1-1)Pain severity - Visualization
ps_data_noNA <- final_data %>%
  filter(!is.na(cohort), !is.na(pain_severity_sum))

ggplot(ps_data_noNA, aes(x = as.factor(cohort), y = pain_severity_sum, fill = as.factor(cohort))) +
    geom_violin(fill = "lightyellow", draw_quantiles = c(0.25, 0.5, 0.75)) +
    labs(x = "Cohort", y = "Pain Severity Sum", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
   geom_jitter(height = 0, width = 0.1) +
    theme_minimal()

ggplot(ps_data_noNA, aes(x = pain_severity_sum, fill = as.factor(cohort))) +
    geom_histogram(aes(y = ..density..), bins = 20, alpha = 0.5, position = "identity") +
    geom_density(alpha = 0.3) +
    facet_wrap(~ cohort, labeller = as_labeller(c("0" = "Control", "1" = "Case"))) +
    labs(title = "Histogram with Density Curve", x = "Pain Severity Sum", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
    theme_minimal()
Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(density)` instead.

#1-2)Pain severity - Association
ps.nonoud <- ps_data_noNA %>%
  filter(cohort==0)%>%
  select(SubjID,pain_severity_sum)

ps.oud <- ps_data_noNA %>%
  filter(cohort==1) %>%
  select(SubjID, pain_severity_sum)

var.test(ps.nonoud$pain_severity_sum, ps.oud$pain_severity_sum) #p-value = 0.001172 -> unequal variance -> welch's t-test

    F test to compare two variances

data:  ps.nonoud$pain_severity_sum and ps.oud$pain_severity_sum
F = 0.7723, num df = 875, denom df = 399, p-value = 0.002074
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.6513741 0.9108211
sample estimates:
ratio of variances 
         0.7723014 
ps_ttest <- t.test(ps.nonoud$pain_severity_sum, ps.oud$pain_severity_sum)
print(ps_ttest) #statistically different.

    Welch Two Sample t-test

data:  ps.nonoud$pain_severity_sum and ps.oud$pain_severity_sum
t = 5.222, df = 690.86, p-value = 2.345e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1.583093 3.490811
sample estimates:
mean of x mean of y 
 21.72945  19.19250 
#2-1)pain interference - Visualization
pi_data_noNA <- final_data %>%
  filter(!is.na(cohort), !is.na(pain_interf_sum))


ggplot(pi_data_noNA, aes(x = as.factor(cohort), y = pain_interf_sum, fill = as.factor(cohort))) +
    geom_violin(fill = "lightyellow", draw_quantiles = c(0.25, 0.5, 0.75)) +
    labs(x = "Cohort", y = "Pain Interference Sum", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
   geom_jitter(height = 0, width = 0.1) +
    theme_minimal()

ggplot(pi_data_noNA, aes(x = pain_interf_sum, fill = as.factor(cohort))) +
    geom_histogram(aes(y = ..density..), bins = 20, alpha = 0.5, position = "identity") +
    geom_density(alpha = 0.3) +
    facet_wrap(~ cohort, labeller = as_labeller(c("0" = "Control", "1" = "Case"))) +
    labs(title = "Histogram with Density Curve", x = "Pain Interference Sum", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
    theme_minimal()

#2-2)pain interference - Association
pi.nonoud <- pi_data_noNA %>%
  filter(cohort==0)%>%
  select(SubjID,pain_interf_sum)

pi.oud <- pi_data_noNA %>%
  filter(cohort==1) %>%
  select(SubjID, pain_interf_sum)

var.test(pi.nonoud$pain_interf_sum, pi.oud$pain_interf_sum) # p-value = 0.8104 -> equal variance

    F test to compare two variances

data:  pi.nonoud$pain_interf_sum and pi.oud$pain_interf_sum
F = 0.9773, num df = 826, denom df = 390, p-value = 0.7834
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.8219306 1.1559193
sample estimates:
ratio of variances 
         0.9773002 
pi_ttest <- t.test(pi.nonoud$pain_interf_sum, pi.oud$pain_interf_sum, var.equal = T)
print(pi_ttest) #statistically different.

    Two Sample t-test

data:  pi.nonoud$pain_interf_sum and pi.oud$pain_interf_sum
t = 2.168, df = 1216, p-value = 0.03035
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.194941 3.906501
sample estimates:
mean of x mean of y 
 33.58525  31.53453 
#3-1) pain catastrophizing - Distribution
pc_data_noNA <- final_data %>%
  filter(!is.na(cohort), !is.na(ca_sum))

ggplot(pc_data_noNA, aes(x = as.factor(cohort), y = ca_sum, fill = as.factor(cohort))) +
    geom_violin(fill = "lightyellow", draw_quantiles = c(0.25, 0.5, 0.75)) +
    labs(x = "Cohort", y = "Pain Catastrophizing Sum", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
   geom_jitter(height = 0, width = 0.1) +
    theme_minimal()

ggplot(pc_data_noNA, aes(x = ca_sum, fill = as.factor(cohort))) +
    geom_histogram(aes(y = ..density..), bins = 20, alpha = 0.5, position = "identity") +
    geom_density(alpha = 0.3) +
    facet_wrap(~ cohort, labeller = as_labeller(c("0" = "Control", "1" = "Case"))) +
    labs(title = "Histogram with Density Curve", x = "Pain Catastrohpizing", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
    theme_minimal()

#3-2) pain catastrophizing - Association
pc.nonoud <- pc_data_noNA %>%
  filter(cohort==0)%>%
  select(SubjID,ca_sum)

pc.oud <- pc_data_noNA %>%
  filter(cohort==1) %>%
  select(SubjID, ca_sum)

var.test(pc.nonoud$ca_sum, pc.oud$ca_sum) #p-value = 0.01178 -> unequal variance -> welch's t-test

    F test to compare two variances

data:  pc.nonoud$ca_sum and pc.oud$ca_sum
F = 1.2806, num df = 836, denom df = 366, p-value = 0.006322
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 1.072940 1.519366
sample estimates:
ratio of variances 
           1.28062 
pc_ttest <- t.test(pc.nonoud$ca_sum, pc.oud$ca_sum)
print(pc_ttest) #statistically different.

    Welch Two Sample t-test

data:  pc.nonoud$ca_sum and pc.oud$ca_sum
t = -7.9363, df = 784.18, p-value = 7.194e-15
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.699746 -2.835854
sample estimates:
mean of x mean of y 
 14.26762  18.03542 
#4-1)PSPS - Mental defeat - Distribution
psps_data_noNA <- final_data %>%
  filter(!is.na(pps_sum))

ggplot(psps_data_noNA, aes(x = as.factor(cohort), y = pps_sum, fill = as.factor(cohort))) +
    geom_violin(fill = "lightyellow", draw_quantiles = c(0.25, 0.5, 0.75)) +
    labs(x = "Cohort", y = "Mental defeat", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
   geom_jitter(height = 0, width = 0.1) +
    theme_minimal()

ggplot(psps_data_noNA, aes(x = ca_sum, fill = as.factor(cohort))) +
    geom_histogram(aes(y = ..density..), bins = 20, alpha = 0.5, position = "identity") +
    geom_density(alpha = 0.3) +
    facet_wrap(~ cohort, labeller = as_labeller(c("0" = "Control", "1" = "Case"))) +
    labs(title = "Histogram with Density Curve", x = "Mental Defeat", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
    theme_minimal()
Warning: Removed 29 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 29 rows containing non-finite outside the scale range
(`stat_density()`).

#4-2) PSPS - Mental defeat - Association
md.nonoud <- psps_data_noNA %>%
  filter(cohort==0)%>%
  select(SubjID,pps_sum)

md.oud <- psps_data_noNA %>%
  filter(cohort==1) %>%
  select(SubjID, pps_sum)

var.test(md.nonoud$pps_sum, md.oud$pps_sum) #p-value = 0.3597 -> equal variance -> Levene's t-test

    F test to compare two variances

data:  md.nonoud$pps_sum and md.oud$pps_sum
F = 1.1209, num df = 582, denom df = 225, p-value = 0.3163
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.896521 1.386564
sample estimates:
ratio of variances 
          1.120889 
md_ttest <- t.test(md.nonoud$pps_sum, md.oud$pps_sum, var.equal=T)
print(md_ttest) #statistically different.

    Two Sample t-test

data:  md.nonoud$pps_sum and md.oud$pps_sum
t = -6.6498, df = 807, p-value = 5.407e-11
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -18.371066  -9.997216
sample estimates:
mean of x mean of y 
 31.39108  45.57522 
#5-1) DSSI - Distribution (SIDSSI, SSDSSI, ISDSSI (all continuous))
dssi_data_noNA <- final_data %>%
  filter(!is.na(SIDSSI), !is.na(SSDSSI), !is.na(ISDSSI))

ggplot(dssi_data_noNA, aes(x = SIDSSI, fill = as.factor(cohort))) +
    geom_histogram(aes(y = ..density..), bins = 20, alpha = 0.5, position = "identity") +
    geom_density(alpha = 0.3) +
    facet_wrap(~ cohort, labeller = as_labeller(c("0" = "Control", "1" = "Case"))) +
    labs(title = "Histogram with Density Curve", x = "SIDSSI", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
    theme_minimal()

ggplot(dssi_data_noNA, aes(x = SSDSSI, fill = as.factor(cohort))) +
    geom_histogram(aes(y = ..density..), bins = 20, alpha = 0.5, position = "identity") +
    geom_density(alpha = 0.3) +
    facet_wrap(~ cohort, labeller = as_labeller(c("0" = "Control", "1" = "Case"))) +
    labs(title = "Histogram with Density Curve", x = "SSDSSI", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
    theme_minimal()

ggplot(dssi_data_noNA, aes(x = ISDSSI, fill = as.factor(cohort))) +
    geom_histogram(aes(y = ..density..), bins = 20, alpha = 0.5, position = "identity") +
    geom_density(alpha = 0.3) +
    facet_wrap(~ cohort, labeller = as_labeller(c("0" = "Control", "1" = "Case"))) +
    labs(title = "Histogram with Density Curve", x = "ISDSSI", fill = "Cohort") +
    scale_fill_discrete(labels = c("Control", "Case")) +
    theme_minimal()

#5-2) DSSI - Association
#5-2-1) SIDSSI
sidssi.nonoud <- dssi_data_noNA %>%
  filter(cohort==0)%>%
  select(SubjID,SIDSSI)

sidssi.oud <- dssi_data_noNA %>%
  filter(cohort==1) %>%
  select(SubjID, SIDSSI)

var.test(sidssi.nonoud$SIDSSI, sidssi.oud$SIDSSI) #p-value = 0.5777 -> equal variance -> Levene's t-test

    F test to compare two variances

data:  sidssi.nonoud$SIDSSI and sidssi.oud$SIDSSI
F = 0.94976, num df = 838, denom df = 375, p-value = 0.5488
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.7970469 1.1252390
sample estimates:
ratio of variances 
         0.9497572 
sidssi_ttest <- t.test(sidssi.nonoud$SIDSSI, sidssi.oud$SIDSSI, var.equal=T)
print(sidssi_ttest) #statistically NOT different.

    Two Sample t-test

data:  sidssi.nonoud$SIDSSI and sidssi.oud$SIDSSI
t = 0.2859, df = 1213, p-value = 0.775
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.2432314  0.3262139
sample estimates:
mean of x mean of y 
 5.193087  5.151596 
#5-2-2) SSDSSI
ssdssi.nonoud <- dssi_data_noNA %>%
  filter(cohort==0)%>%
  select(SubjID,SSDSSI)

ssdssi.oud <- dssi_data_noNA %>%
  filter(cohort==1) %>%
  select(SubjID, SSDSSI)

var.test(ssdssi.nonoud$SSDSSI, ssdssi.oud$SSDSSI) #p-value = 0.006159 -> unequal variance

    F test to compare two variances

data:  ssdssi.nonoud$SSDSSI and ssdssi.oud$SSDSSI
F = 0.79046, num df = 838, denom df = 375, p-value = 0.006448
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.6633604 0.9365058
sample estimates:
ratio of variances 
         0.7904571 
ssdssi_ttest <- t.test(ssdssi.nonoud$SSDSSI, ssdssi.oud$SSDSSI)
print(ssdssi_ttest) #statistically different.

    Welch Two Sample t-test

data:  ssdssi.nonoud$SSDSSI and ssdssi.oud$SSDSSI
t = 9.7284, df = 651.18, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 1.724496 2.596701
sample estimates:
mean of x mean of y 
 17.94517  15.78457 
#5-2-3) ISDSSI
isdssi.nonoud <- dssi_data_noNA %>%
  filter(cohort==0)%>%
  select(SubjID,ISDSSI)

isdssi.oud <- dssi_data_noNA %>%
  filter(cohort==1) %>%
  select(SubjID, ISDSSI)

var.test(isdssi.nonoud$ISDSSI, isdssi.oud$ISDSSI) #p-value = 1.458e-06 -> unequal variance

    F test to compare two variances

data:  isdssi.nonoud$ISDSSI and isdssi.oud$ISDSSI
F = 0.67218, num df = 838, denom df = 375, p-value = 3.597e-06
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.5641025 0.7963774
sample estimates:
ratio of variances 
         0.6721818 
isdssi_ttest <- t.test(isdssi.nonoud$ISDSSI, isdssi.oud$ISDSSI)
print(isdssi_ttest) #statistically different.

    Welch Two Sample t-test

data:  isdssi.nonoud$ISDSSI and isdssi.oud$ISDSSI
t = 3.294, df = 610.18, p-value = 0.001045
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.2609288 1.0314215
sample estimates:
mean of x mean of y 
 8.702026  8.055851 

2.4.1 Descriptive Analysis

Sample characteristics

Gender distribution is relatively even between groups. Individuals with OUD are generally younger. Marital status is higher among non-OUD individuals.Non-OUD individuals tend to have higher levels of education. Financial stability is less common among individuals with OUD.

library(Hmisc)

Attaching package: 'Hmisc'
The following object is masked from 'package:parsnip':

    translate
The following objects are masked from 'package:dplyr':

    src, summarize
The following objects are masked from 'package:base':

    format.pval, units
library(table1)

Attaching package: 'table1'
The following objects are masked from 'package:Hmisc':

    label, label<-, units
The following objects are masked from 'package:base':

    units, units<-
# Preprocessing: Filter out rows where "Refused to answer" is present
final_data_clean <- final_data[!(final_data$DEM002 == "(X)Subject refused to answer" | 
                                 final_data$DEM004 == "(X)Subject refused to answer"), ]
final_data_clean <- droplevels(final_data_clean)

# Relabel the cohort variable
final_data_clean$cohort <- factor(final_data_clean$cohort, levels = c(0, 1), labels = c("Non-OUD", "OUD"))

# Assigning labels to variables for better readability
label(final_data_clean$DEM002) <- "Gender"
label(final_data_clean$DEM003) <- "Age (years)"
label(final_data_clean$DEM004) <- "Marital Status"
label(final_data_clean$DEM005) <- "Household"
label(final_data_clean$DEM006) <- "Education"
label(final_data_clean$DEM007) <- "Employment"
label(final_data_clean$DEM008) <- "Financial Status"

# Print demographic characteristics
cat("Demographic characteristics in non-OUD and OUD groups")
Demographic characteristics in non-OUD and OUD groups
# Creating the table
table1(~ DEM002 + DEM003 + DEM004 + DEM005 + DEM006 + DEM007 + DEM008 | cohort, 
       data = final_data_clean, rowlabelhead = "Demographics")
Demographics Non-OUD
(N=901)
OUD
(N=430)
Overall
(N=1331)
Gender
Male 299 (33.2%) 231 (53.7%) 530 (39.8%)
Female 602 (66.8%) 199 (46.3%) 801 (60.2%)
Age (years)
Mean (SD) 38.2 (12.7) 24.9 (11.1) 33.9 (13.7)
Median [Min, Max] 39.0 [2.00, 69.0] 22.0 [5.00, 57.0] 35.0 [2.00, 69.0]
Marital Status
Married/Partnered 539 (59.8%) 98 (22.8%) 637 (47.9%)
Separated 28 (3.1%) 33 (7.7%) 61 (4.6%)
Divorced 154 (17.1%) 77 (17.9%) 231 (17.4%)
Never married 145 (16.1%) 210 (48.8%) 355 (26.7%)
Widowed 35 (3.9%) 12 (2.8%) 47 (3.5%)
Household
Live alone 149 (16.5%) 80 (18.6%) 229 (17.2%)
One other person 407 (45.2%) 116 (27.0%) 523 (39.3%)
More than one other person 345 (38.3%) 234 (54.4%) 579 (43.5%)
Education
Grade 6 or less 3 (0.3%) 2 (0.5%) 5 (0.4%)
Grade 7 to 12 49 (5.4%) 77 (17.9%) 126 (9.5%)
Graduated high school or high school equivalent 183 (20.3%) 167 (38.8%) 350 (26.3%)
Part college 233 (25.9%) 103 (24.0%) 336 (25.2%)
Graduated 2 years college 120 (13.3%) 29 (6.7%) 149 (11.2%)
Graduated 4 years college 145 (16.1%) 28 (6.5%) 173 (13.0%)
Part graduate/professional school 44 (4.9%) 7 (1.6%) 51 (3.8%)
Completed graduate/professional school 124 (13.8%) 17 (4.0%) 141 (10.6%)
Employment
Yes 230 (25.5%) 120 (27.9%) 350 (26.3%)
No 671 (74.5%) 310 (72.1%) 981 (73.7%)
Financial Status
Ca not make ends meet 176 (19.5%) 216 (50.2%) 392 (29.5%)
Have just enough to get along 416 (46.2%) 179 (41.6%) 595 (44.7%)
Are comfortable 309 (34.3%) 35 (8.1%) 344 (25.8%)

Predictors - descriptive analysis (not working due to missingness)

# Relabel the cohort variable
final_data_clean$cohort <- factor(final_data_clean$cohort, levels = c(0, 1), labels = c("Non-OUD", "OUD"))

# Assigning labels to variables for better readability
label(final_data_clean$pain_severity_sum) <- "Pain severity"
label(final_data_clean$pain_interf_sum) <-  "Pain interference"
label(final_data_clean$dependence_level) <- "Nicotine dependence"
label(final_data_clean$Anxiety) <- "Anxiety"
label(final_data_clean$Depression) <- "Depression"
label(final_data_clean$ca_sum) <- "Pain Catastrophizing"
label(final_data_clean$pps_sum) <- "Mental defeat"
label(final_data_clean$suicidality_risk) <- "Suicidality"
label(final_data_clean$SIDSSI) <- "Social interaction"
label(final_data_clean$SSDSSI) <- "Satisfaction with social support"
label(final_data_clean$ISDSSI) <- "Social support"

# Print demographic characteristics
cat("Clinical and Psychosocial Characteristics Stratified by OUD Status")
Clinical and Psychosocial Characteristics Stratified by OUD Status

2.5 Developing ML models

Given the skewness of the dataset, three algorithms were selected for their effectiveness in handling skewed variable distributions: Logistic Regression, Random Forest, and XGBoost.

2.5.1 Missing data Imputation

Before developing models, I need to address missing data. For this, random forest imputation was utilized. This method accounts for the distribution of variables and imputes values accordingly, rather than relying on mean, median, or mode. For continuous predictors, the imputed value is calculated as the weighted average of non-missing observations, with weights determined by proximities. For categorical predictors, the imputed value corresponds to the category with the highest average proximity. While most variables had less than 10% missing data, the mental defeat score (pps_sum) and suicidality risk exhibited significant levels of missingness. However, these variables are clinically important factors associated with the outcome (OUD). As such, they were included in the prediction model.

To evaluate whether these factors can be reliably incorporated and whether the imputation was performed appropriately, I will compare models with and without these variables. This analysis will be discussed in the section titled “Model Selection Despite Missing Data Imputation.”

library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'
The following object is masked from 'package:dplyr':

    combine
The following object is masked from 'package:ggplot2':

    margin
##Missing data count when merged
colSums(is.na(final_data))
           SubjID            cohort            DEM002            DEM003 
                0                 0                 0                 0 
           DEM004            DEM005            DEM006            DEM007 
                0                 0                 0                 0 
           DEM008 pain_severity_sum   pain_interf_sum  dependence_level 
                0                55               113                58 
          Anxiety        Depression            ca_sum           pps_sum 
               70                70               127               522 
 suicidality_risk           alcohol            SIDSSI            SSDSSI 
              606                77               100               114 
           ISDSSI 
              103 
# Missing data imputation, using randomforest package
str(final_data)
'data.frame':   1331 obs. of  21 variables:
 $ SubjID           : num  1000 1001 1002 1003 1004 ...
 $ cohort           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ DEM002           : Factor w/ 3 levels "(X)Subject refused to answer",..: 3 3 2 3 3 3 2 2 3 2 ...
  ..- attr(*, "label")= chr "What is your gender"
 $ DEM003           : num  35 32 28 42 27 39 24 42 39 39 ...
 $ DEM004           : Factor w/ 6 levels "(X)Subject refused to answer",..: 2 2 4 2 2 2 5 2 4 2 ...
  ..- attr(*, "label")= chr "Current marital status"
 $ DEM005           : Factor w/ 3 levels "Live alone","One other person",..: 1 2 1 2 2 2 1 2 1 3 ...
 $ DEM006           : Factor w/ 8 levels "Grade 6 or less",..: 1 4 3 5 6 7 6 2 5 3 ...
 $ DEM007           : Factor w/ 2 levels "Yes","No": 1 2 2 2 1 2 2 2 1 2 ...
 $ DEM008           : Factor w/ 3 levels "Ca not make ends meet",..: 1 2 1 2 3 2 2 3 3 1 ...
 $ pain_severity_sum: num  NA 32 25 12 13 20 22 23 22 25 ...
 $ pain_interf_sum  : num  NA 48 50 3 15 39 54 14 55 35 ...
 $ dependence_level : chr  "Moderate" "Non-smoker" "Non-smoker" "Non-smoker" ...
 $ Anxiety          : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 1 2 1 ...
 $ Depression       : Factor w/ 2 levels "0","1": 2 2 2 1 2 2 2 1 2 1 ...
 $ ca_sum           : num  22 9 18 5 12 16 18 0 30 21 ...
 $ pps_sum          : num  NA NA NA 10 44 NA 26 NA NA NA ...
 $ suicidality_risk : chr  NA NA NA NA ...
 $ alcohol          : num  0 NA NA NA NA NA 1 0 1 0 ...
 $ SIDSSI           : num  NA 5 4 10 3 4 5 9 5 9 ...
  ..- attr(*, "label")= chr "Total score for Social Interaction"
  ..- attr(*, "format.spss")= chr "F8.2"
 $ SSDSSI           : num  NA 13 9 21 21 17 15 21 10 21 ...
  ..- attr(*, "label")= chr "Total score for Subjective Support"
  ..- attr(*, "format.spss")= chr "F8.2"
 $ ISDSSI           : num  NA 8 10 6 12 9 10 9 0 11 ...
  ..- attr(*, "label")= chr "Total score for Instrumental Support"
  ..- attr(*, "format.spss")= chr "F8.2"
# Using rfImpute to impute missing values in predictor variables
final_data <- final_data %>%
  mutate(across(where(is.character), as.factor))
imputed_data <- rfImpute(cohort ~ ., data = final_data, iter = 5, ntree = 500)
ntree      OOB      1      2
  500:  13.52%  6.66% 27.91%
ntree      OOB      1      2
  500:  13.22%  6.10% 28.14%
ntree      OOB      1      2
  500:  13.22%  5.77% 28.84%
ntree      OOB      1      2
  500:  13.30%  6.22% 28.14%
ntree      OOB      1      2
  500:  13.45%  6.44% 28.14%
imputed_data <- imputed_data %>% 
  select(-SubjID)

head(imputed_data) 
  cohort DEM002 DEM003            DEM004           DEM005
1      0 Female     35 Married/Partnered       Live alone
2      0 Female     32 Married/Partnered One other person
3      0   Male     28          Divorced       Live alone
4      0 Female     42 Married/Partnered One other person
5      0 Female     27 Married/Partnered One other person
6      0 Female     39 Married/Partnered One other person
                                           DEM006 DEM007
1                                 Grade 6 or less    Yes
2                                    Part college     No
3 Graduated high school or high school equivalent     No
4                       Graduated 2 years college     No
5                       Graduated 4 years college    Yes
6               Part graduate/professional school     No
                         DEM008 pain_severity_sum pain_interf_sum
1         Ca not make ends meet          22.77192           36.78
2 Have just enough to get along          32.00000           48.00
3         Ca not make ends meet          25.00000           50.00
4 Have just enough to get along          12.00000            3.00
5               Are comfortable          13.00000           15.00
6 Have just enough to get along          20.00000           39.00
  dependence_level Anxiety Depression ca_sum  pps_sum suicidality_risk
1         Moderate       1          1     22 33.00521         Low Risk
2       Non-smoker       1          1      9 33.80834         Low Risk
3       Non-smoker       1          1     18 37.47518         Low Risk
4       Non-smoker       1          0      5 10.00000         Low Risk
5       Non-smoker       1          1     12 44.00000         Low Risk
6       Non-smoker       1          1     16 30.35042         Low Risk
    alcohol    SIDSSI   SSDSSI    ISDSSI
1 0.0000000  5.135818 18.11672  8.897234
2 0.4496675  5.000000 13.00000  8.000000
3 0.3850812  4.000000  9.00000 10.000000
4 0.4425485 10.000000 21.00000  6.000000
5 0.5076271  3.000000 21.00000 12.000000
6 0.4504239  4.000000 17.00000  9.000000
colSums(is.na(imputed_data))
           cohort            DEM002            DEM003            DEM004 
                0                 0                 0                 0 
           DEM005            DEM006            DEM007            DEM008 
                0                 0                 0                 0 
pain_severity_sum   pain_interf_sum  dependence_level           Anxiety 
                0                 0                 0                 0 
       Depression            ca_sum           pps_sum  suicidality_risk 
                0                 0                 0                 0 
          alcohol            SIDSSI            SSDSSI            ISDSSI 
                0                 0                 0                 0 
summary(imputed_data)
 cohort                           DEM002        DEM003     
 0:901   (X)Subject refused to answer:  0   Min.   : 2.00  
 1:430   Male                        :530   1st Qu.:23.00  
         Female                      :801   Median :35.00  
                                            Mean   :33.92  
                                            3rd Qu.:44.00  
                                            Max.   :69.00  
                                                           
                          DEM004                           DEM005   
 (X)Subject refused to answer:  0   Live alone                :229  
 Married/Partnered           :637   One other person          :523  
 Separated                   : 61   More than one other person:579  
 Divorced                    :231                                   
 Never married               :355                                   
 Widowed                     : 47                                   
                                                                    
                                             DEM006    DEM007   
 Graduated high school or high school equivalent:350   Yes:350  
 Part college                                   :336   No :981  
 Graduated 4 years college                      :173            
 Graduated 2 years college                      :149            
 Completed graduate/professional school         :141            
 Grade 7 to 12                                  :126            
 (Other)                                        : 56            
                           DEM008    pain_severity_sum pain_interf_sum
 Ca not make ends meet        :392   Min.   : 0.00     Min.   : 0.00  
 Have just enough to get along:595   1st Qu.:16.55     1st Qu.:23.00  
 Are comfortable              :344   Median :21.00     Median :34.00  
                                     Mean   :20.87     Mean   :32.76  
                                     3rd Qu.:26.00     3rd Qu.:44.00  
                                     Max.   :40.00     Max.   :60.00  
                                                                      
        dependence_level Anxiety Depression     ca_sum         pps_sum     
 High           : 69     0:652   0:708      Min.   : 0.00   Min.   : 0.00  
 Low            :107     1:679   1:623      1st Qu.:10.00   1st Qu.:17.41  
 Low to Moderate:122                        Median :15.00   Median :30.37  
 Moderate       :196                        Mean   :15.42   Mean   :35.32  
 Non-smoker     :837                        3rd Qu.:21.00   3rd Qu.:53.08  
                                            Max.   :36.00   Max.   :96.00  
                                                                           
  suicidality_risk    alcohol           SIDSSI           SSDSSI     
 High Risk: 330    Min.   :0.0000   Min.   : 0.000   Min.   : 7.00  
 Low Risk :1001    1st Qu.:0.0000   1st Qu.: 4.000   1st Qu.:15.00  
                   Median :0.0000   Median : 5.000   Median :18.00  
                   Mean   :0.3911   Mean   : 5.158   Mean   :17.27  
                   3rd Qu.:1.0000   3rd Qu.: 7.000   3rd Qu.:20.00  
                   Max.   :1.0000   Max.   :13.000   Max.   :21.00  
                                                                    
     ISDSSI     
 Min.   : 0.00  
 1st Qu.: 7.00  
 Median : 9.00  
 Mean   : 8.48  
 3rd Qu.:11.00  
 Max.   :12.00  
                

2.5.2 Missing data Imputation

library(rsample)

set.seed (1234)
mydata_split <- initial_split(imputed_data,
                            strata = cohort,# maintaining same percentage of cases/controls
                            prop = 0.80)
mydata_training <- training(mydata_split)
mydata_test <- testing(mydata_split)

#10-fold cross validation: given there is a small sample size and I am not doing parameter tuning, I will not create held out data and use the whole dataset.
set.seed(1234)
mydata_folds <-vfold_cv(imputed_data, v=10) #okay to use the whole data, without held out data
mydata_folds
#  10-fold cross-validation 
# A tibble: 10 × 2
   splits             id    
   <list>             <chr> 
 1 <split [1197/134]> Fold01
 2 <split [1198/133]> Fold02
 3 <split [1198/133]> Fold03
 4 <split [1198/133]> Fold04
 5 <split [1198/133]> Fold05
 6 <split [1198/133]> Fold06
 7 <split [1198/133]> Fold07
 8 <split [1198/133]> Fold08
 9 <split [1198/133]> Fold09
10 <split [1198/133]> Fold10

2.5.3. Model Selection With Missing Data Imputation

In order to determine whether the imputation was done properly without changing the model performance too much, I will compare two models; one with all variables included and the other with excluding these two variabels. I could also compare the model using complete cases, but given that sample size between the models too different, I chose this comparison. When comparing, along with auc roc value, false negative (which has more clinically importance) will also be considered. I will select the model that has better performance with false negative. significant predictors in two models will also be compared.

Model excluding the two variables with significant missingness

This model can be compared with the model including these two variables in 2.5.4.

imputed_data_drop <- imputed_data %>%
  select(-pps_sum, -suicidality_risk)

set.seed (1234)
mydata_split_drop <- initial_split(imputed_data_drop,
                            strata = cohort,#maintaining same percentage of cases/controls
                            prop = 0.80)
mydata_training_drop <- training(mydata_split_drop)
mydata_test_drop <- testing(mydata_split_drop)

mydata_glm_drop <- glm (cohort ~., data = mydata_training_drop, family = binomial(logit))
summary(mydata_glm_drop)

Call:
glm(formula = cohort ~ ., family = binomial(logit), data = mydata_training_drop)

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                            4.942656   1.499017
DEM002Female                                          -0.856955   0.201204
DEM003                                                -0.064206   0.009429
DEM004Separated                                        0.937199   0.430508
DEM004Divorced                                         0.836902   0.272311
DEM004Never married                                    1.010768   0.260428
DEM004Widowed                                          0.592886   0.669202
DEM005One other person                                 0.271899   0.305707
DEM005More than one other person                       0.521904   0.304877
DEM006Grade 7 to 12                                   -0.073755   1.181374
DEM006Graduated high school or high school equivalent -0.466416   1.153980
DEM006Part college                                    -0.911793   1.157818
DEM006Graduated 2 years college                       -1.415609   1.180843
DEM006Graduated 4 years college                       -1.039536   1.179874
DEM006Part graduate/professional school               -2.107400   1.357155
DEM006Completed graduate/professional school          -0.709468   1.198047
DEM007No                                              -0.234212   0.233332
DEM008Have just enough to get along                   -0.439306   0.216660
DEM008Are comfortable                                 -0.888200   0.330163
pain_severity_sum                                     -0.063186   0.017198
pain_interf_sum                                       -0.007501   0.009782
dependence_levelLow                                   -0.321873   0.470807
dependence_levelLow to Moderate                        0.499488   0.470429
dependence_levelModerate                               0.438853   0.436700
dependence_levelNon-smoker                            -1.415626   0.409711
Anxiety1                                              -0.101627   0.245515
Depression1                                            0.174846   0.248811
ca_sum                                                 0.044102   0.015764
alcohol                                               -0.707190   0.213333
SIDSSI                                                 0.145705   0.047275
SSDSSI                                                -0.070290   0.036055
ISDSSI                                                -0.067315   0.038020
                                                      z value Pr(>|z|)    
(Intercept)                                             3.297 0.000976 ***
DEM002Female                                           -4.259 2.05e-05 ***
DEM003                                                 -6.809 9.80e-12 ***
DEM004Separated                                         2.177 0.029483 *  
DEM004Divorced                                          3.073 0.002117 ** 
DEM004Never married                                     3.881 0.000104 ***
DEM004Widowed                                           0.886 0.375639    
DEM005One other person                                  0.889 0.373782    
DEM005More than one other person                        1.712 0.086924 .  
DEM006Grade 7 to 12                                    -0.062 0.950219    
DEM006Graduated high school or high school equivalent  -0.404 0.686080    
DEM006Part college                                     -0.788 0.430983    
DEM006Graduated 2 years college                        -1.199 0.230601    
DEM006Graduated 4 years college                        -0.881 0.378287    
DEM006Part graduate/professional school                -1.553 0.120469    
DEM006Completed graduate/professional school           -0.592 0.553725    
DEM007No                                               -1.004 0.315488    
DEM008Have just enough to get along                    -2.028 0.042598 *  
DEM008Are comfortable                                  -2.690 0.007141 ** 
pain_severity_sum                                      -3.674 0.000239 ***
pain_interf_sum                                        -0.767 0.443208    
dependence_levelLow                                    -0.684 0.494189    
dependence_levelLow to Moderate                         1.062 0.288339    
dependence_levelModerate                                1.005 0.314930    
dependence_levelNon-smoker                             -3.455 0.000550 ***
Anxiety1                                               -0.414 0.678922    
Depression1                                             0.703 0.482226    
ca_sum                                                  2.798 0.005149 ** 
alcohol                                                -3.315 0.000917 ***
SIDSSI                                                  3.082 0.002056 ** 
SSDSSI                                                 -1.950 0.051232 .  
ISDSSI                                                 -1.771 0.076644 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1339.23  on 1063  degrees of freedom
Residual deviance:  724.28  on 1032  degrees of freedom
AIC: 788.28

Number of Fisher Scoring iterations: 6
vif(mydata_glm_drop)
                      GVIF Df GVIF^(1/(2*Df))
DEM002            1.114282  1        1.055596
DEM003            1.416141  1        1.190017
DEM004            1.631771  4        1.063120
DEM005            1.451766  2        1.097676
DEM006            1.631024  7        1.035561
DEM007            1.207927  1        1.099057
DEM008            1.378104  2        1.083479
pain_severity_sum 1.831957  1        1.353498
pain_interf_sum   2.271775  1        1.507241
dependence_level  1.311228  4        1.034451
Anxiety           1.632921  1        1.277858
Depression        1.727579  1        1.314374
ca_sum            1.663924  1        1.289932
alcohol           1.100878  1        1.049227
SIDSSI            1.246200  1        1.116333
SSDSSI            1.815002  1        1.347220
ISDSSI            1.442273  1        1.200947
#Model building
lr_cls_spec_2 <- 
  logistic_reg() |> 
  set_engine("glm")

#Model fit 
lr_cls_fit_2 <- 
  lr_cls_spec_2 |>
  fit(cohort ~ ., data = mydata_training_drop)

#Prediction using the "TEST" dataset
mydata.lr.pred.values.test_2 <-  bind_cols(
  truth = mydata_test_drop$cohort,
  predict(lr_cls_fit_2, mydata_test_drop),
  predict(lr_cls_fit_2, mydata_test_drop, type = "prob"))

mydata.lr.pred.values.test_2
# A tibble: 267 × 4
   truth .pred_class .pred_0 .pred_1
   <fct> <fct>         <dbl>   <dbl>
 1 0     0             0.963 0.0371 
 2 0     0             0.919 0.0815 
 3 0     0             0.616 0.384  
 4 0     0             0.942 0.0581 
 5 0     0             0.947 0.0531 
 6 0     0             0.874 0.126  
 7 0     0             0.958 0.0424 
 8 1     1             0.405 0.595  
 9 0     0             0.991 0.00858
10 0     0             0.747 0.253  
# ℹ 257 more rows
#ROC plots for testing data
autoplot(roc_curve(mydata.lr.pred.values.test_2, 
                   truth, 
                   .pred_0))

#Metrics of prediction on test data
metrics(mydata.lr.pred.values.test_2, truth, .pred_class, .pred_0)
# A tibble: 4 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.813
2 kap         binary         0.560
3 mn_log_loss binary         0.401
4 roc_auc     binary         0.880
conf_matrix_wihtout <- mydata.lr.pred.values.test_2 %>%
  conf_mat(truth = truth, estimate = .pred_class)

# Print the confusion matrix
print(conf_matrix_wihtout)
          Truth
Prediction   0   1
         0 160  29
         1  21  57

Result: Given the similarity in model performance, the model that includes the two variables is preferable, as these variables hold clinical significance. More importantly, the imputation process does not appear to negatively affect the model’s performance. Especially, the model with the two variables showed a slightly higher AUC-ROC value (0.88596 < 0.88655) and fewer false negative cases (exclusion model n=29 Vs inclusion model n=28 ). Clinically, minimizing false negatives is crucial in a predictive model. Therefore, further development of the algorithm will proceed using the imputed data with these two variables included.

2.5.4. Logistic regression

glm (Training/Test)

#general model, with the  whole dataset
forwholeglm <- final_data %>% 
  select(-SubjID)
whole_glm <- glm(cohort ~., data = forwholeglm, family=binomial(logit))
summary(whole_glm)

Call:
glm(formula = cohort ~ ., family = binomial(logit), data = forwholeglm)

Coefficients:
                                                        Estimate Std. Error
(Intercept)                                            6.305e+00  2.434e+00
DEM002Female                                          -8.908e-01  2.561e-01
DEM003                                                -6.194e-02  1.246e-02
DEM004Separated                                        5.731e-01  6.136e-01
DEM004Divorced                                         3.985e-02  3.799e-01
DEM004Never married                                    4.745e-01  3.256e-01
DEM004Widowed                                         -2.376e-01  7.551e-01
DEM005One other person                                -4.703e-01  4.046e-01
DEM005More than one other person                      -2.187e-01  4.111e-01
DEM006Grade 7 to 12                                    1.349e-01  2.160e+00
DEM006Graduated high school or high school equivalent -5.028e-02  2.139e+00
DEM006Part college                                    -4.823e-01  2.138e+00
DEM006Graduated 2 years college                       -8.433e-01  2.155e+00
DEM006Graduated 4 years college                        1.376e-01  2.149e+00
DEM006Part graduate/professional school               -6.214e-01  2.266e+00
DEM006Completed graduate/professional school          -2.230e-01  2.160e+00
DEM007No                                               1.324e-01  3.052e-01
DEM008Have just enough to get along                   -9.866e-02  2.796e-01
DEM008Are comfortable                                 -1.135e+00  4.317e-01
pain_severity_sum                                     -2.457e-02  2.214e-02
pain_interf_sum                                       -2.181e-02  1.265e-02
dependence_levelLow                                   -7.406e-01  5.965e-01
dependence_levelLow to Moderate                        1.912e-01  5.847e-01
dependence_levelModerate                               3.932e-01  5.426e-01
dependence_levelNon-smoker                            -1.420e+00  5.005e-01
Anxiety1                                               2.331e-01  3.254e-01
Depression1                                            1.413e-01  3.393e-01
ca_sum                                                 3.054e-02  2.403e-02
pps_sum                                               -9.971e-05  6.625e-03
suicidality_riskLow Risk                              -6.244e-01  3.303e-01
alcohol                                               -6.094e-01  2.638e-01
SIDSSI                                                 1.692e-01  5.788e-02
SSDSSI                                                -1.355e-01  4.488e-02
ISDSSI                                                -5.281e-02  4.639e-02
                                                      z value Pr(>|z|)    
(Intercept)                                             2.590 0.009601 ** 
DEM002Female                                           -3.479 0.000504 ***
DEM003                                                 -4.971 6.66e-07 ***
DEM004Separated                                         0.934 0.350314    
DEM004Divorced                                          0.105 0.916452    
DEM004Never married                                     1.457 0.144996    
DEM004Widowed                                          -0.315 0.752988    
DEM005One other person                                 -1.162 0.245090    
DEM005More than one other person                       -0.532 0.594660    
DEM006Grade 7 to 12                                     0.062 0.950198    
DEM006Graduated high school or high school equivalent  -0.024 0.981249    
DEM006Part college                                     -0.226 0.821511    
DEM006Graduated 2 years college                        -0.391 0.695617    
DEM006Graduated 4 years college                         0.064 0.948936    
DEM006Part graduate/professional school                -0.274 0.783887    
DEM006Completed graduate/professional school           -0.103 0.917754    
DEM007No                                                0.434 0.664494    
DEM008Have just enough to get along                    -0.353 0.724211    
DEM008Are comfortable                                  -2.628 0.008584 ** 
pain_severity_sum                                      -1.110 0.267008    
pain_interf_sum                                        -1.724 0.084781 .  
dependence_levelLow                                    -1.242 0.214378    
dependence_levelLow to Moderate                         0.327 0.743639    
dependence_levelModerate                                0.725 0.468598    
dependence_levelNon-smoker                             -2.837 0.004547 ** 
Anxiety1                                                0.716 0.473818    
Depression1                                             0.416 0.677142    
ca_sum                                                  1.271 0.203667    
pps_sum                                                -0.015 0.987992    
suicidality_riskLow Risk                               -1.890 0.058735 .  
alcohol                                                -2.310 0.020901 *  
SIDSSI                                                  2.924 0.003460 ** 
SSDSSI                                                 -3.020 0.002530 ** 
ISDSSI                                                 -1.138 0.254915    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 785.04  on 607  degrees of freedom
Residual deviance: 445.80  on 574  degrees of freedom
  (723 observations deleted due to missingness)
AIC: 513.8

Number of Fisher Scoring iterations: 6
vif(whole_glm)
                      GVIF Df GVIF^(1/(2*Df))
DEM002            1.126734  1        1.061477
DEM003            1.615394  1        1.270981
DEM004            2.071490  4        1.095306
DEM005            1.634523  2        1.130701
DEM006            1.912659  7        1.047411
DEM007            1.283880  1        1.133084
DEM008            1.464056  2        1.099992
pain_severity_sum 1.850743  1        1.360420
pain_interf_sum   2.499721  1        1.581051
dependence_level  1.511706  4        1.053012
Anxiety           1.770504  1        1.330603
Depression        2.010602  1        1.417957
ca_sum            2.334839  1        1.528018
pps_sum           2.329359  1        1.526224
suicidality_risk  1.333993  1        1.154986
alcohol           1.167542  1        1.080528
SIDSSI            1.317627  1        1.147879
SSDSSI            1.745899  1        1.321325
ISDSSI            1.477109  1        1.215364
mydata_glm <- glm (cohort ~., data = mydata_training, family = binomial(logit))
summary(mydata_glm) #sig: pain_severity, smoking low, smoking non-smoker

Call:
glm(formula = cohort ~ ., family = binomial(logit), data = mydata_training)

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                            5.255608   1.560909
DEM002Female                                          -0.885795   0.203752
DEM003                                                -0.061849   0.009592
DEM004Separated                                        0.793372   0.444420
DEM004Divorced                                         0.734701   0.275477
DEM004Never married                                    0.940694   0.264895
DEM004Widowed                                          0.378506   0.710802
DEM005One other person                                 0.236167   0.310174
DEM005More than one other person                       0.438869   0.310182
DEM006Grade 7 to 12                                   -0.310698   1.236864
DEM006Graduated high school or high school equivalent -0.605735   1.208196
DEM006Part college                                    -1.131702   1.213205
DEM006Graduated 2 years college                       -1.485880   1.234504
DEM006Graduated 4 years college                       -1.314164   1.235440
DEM006Part graduate/professional school               -2.114274   1.407303
DEM006Completed graduate/professional school          -0.908538   1.247878
DEM007No                                              -0.273371   0.237484
DEM008Have just enough to get along                   -0.326568   0.222831
DEM008Are comfortable                                 -0.738880   0.338253
pain_severity_sum                                     -0.056056   0.017495
pain_interf_sum                                       -0.009014   0.010112
dependence_levelLow                                   -0.429561   0.477173
dependence_levelLow to Moderate                        0.489151   0.478873
dependence_levelModerate                               0.416688   0.444427
dependence_levelNon-smoker                            -1.323109   0.415861
Anxiety1                                              -0.167235   0.252078
Depression1                                            0.117469   0.259111
ca_sum                                                 0.023654   0.017209
pps_sum                                                0.008114   0.005921
suicidality_riskLow Risk                              -0.765620   0.251746
alcohol                                               -0.739587   0.216871
SIDSSI                                                 0.158759   0.048217
SSDSSI                                                -0.051900   0.036759
ISDSSI                                                -0.066384   0.038358
                                                      z value Pr(>|z|)    
(Intercept)                                             3.367 0.000760 ***
DEM002Female                                           -4.347 1.38e-05 ***
DEM003                                                 -6.448 1.14e-10 ***
DEM004Separated                                         1.785 0.074231 .  
DEM004Divorced                                          2.667 0.007653 ** 
DEM004Never married                                     3.551 0.000383 ***
DEM004Widowed                                           0.533 0.594376    
DEM005One other person                                  0.761 0.446418    
DEM005More than one other person                        1.415 0.157105    
DEM006Grade 7 to 12                                    -0.251 0.801661    
DEM006Graduated high school or high school equivalent  -0.501 0.616121    
DEM006Part college                                     -0.933 0.350913    
DEM006Graduated 2 years college                        -1.204 0.228734    
DEM006Graduated 4 years college                        -1.064 0.287455    
DEM006Part graduate/professional school                -1.502 0.133005    
DEM006Completed graduate/professional school           -0.728 0.466573    
DEM007No                                               -1.151 0.249685    
DEM008Have just enough to get along                    -1.466 0.142775    
DEM008Are comfortable                                  -2.184 0.028933 *  
pain_severity_sum                                      -3.204 0.001355 ** 
pain_interf_sum                                        -0.891 0.372722    
dependence_levelLow                                    -0.900 0.368002    
dependence_levelLow to Moderate                         1.021 0.307035    
dependence_levelModerate                                0.938 0.348457    
dependence_levelNon-smoker                             -3.182 0.001465 ** 
Anxiety1                                               -0.663 0.507058    
Depression1                                             0.453 0.650294    
ca_sum                                                  1.375 0.169270    
pps_sum                                                 1.370 0.170531    
suicidality_riskLow Risk                               -3.041 0.002356 ** 
alcohol                                                -3.410 0.000649 ***
SIDSSI                                                  3.293 0.000993 ***
SSDSSI                                                 -1.412 0.157974    
ISDSSI                                                 -1.731 0.083515 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1339.23  on 1063  degrees of freedom
Residual deviance:  708.73  on 1030  degrees of freedom
AIC: 776.73

Number of Fisher Scoring iterations: 6
vif(mydata_glm) #VIF all less than 5; no significant colinearity was found among variables
                      GVIF Df GVIF^(1/(2*Df))
DEM002            1.115358  1        1.056105
DEM003            1.439704  1        1.199876
DEM004            1.654341  4        1.064947
DEM005            1.460918  2        1.099402
DEM006            1.694028  7        1.038368
DEM007            1.220643  1        1.104827
DEM008            1.431904  2        1.093902
pain_severity_sum 1.845443  1        1.358471
pain_interf_sum   2.373240  1        1.540533
dependence_level  1.362805  4        1.039451
Anxiety           1.683350  1        1.297440
Depression        1.825862  1        1.351244
ca_sum            1.953333  1        1.397617
pps_sum           2.245337  1        1.498445
suicidality_risk  1.476975  1        1.215309
alcohol           1.114609  1        1.055751
SIDSSI            1.273091  1        1.128313
SSDSSI            1.849462  1        1.359949
ISDSSI            1.459411  1        1.208061
#Model building
lr_cls_spec <- 
  logistic_reg() |> 
  set_engine("glm")

#Model fit to training dataset
lr_cls_fit <- 
  lr_cls_spec |>
  fit(cohort ~ ., data = mydata_training)

#Prediction using the "testing" dataset
mydata.lr.pred.values.test <-  bind_cols(
  truth = mydata_test$cohort,
  predict(lr_cls_fit, mydata_test),
  predict(lr_cls_fit, mydata_test, type = "prob")
)
mydata.lr.pred.values.test
# A tibble: 267 × 4
   truth .pred_class .pred_0 .pred_1
   <fct> <fct>         <dbl>   <dbl>
 1 0     0             0.963  0.0368
 2 0     0             0.887  0.113 
 3 0     0             0.734  0.266 
 4 0     0             0.932  0.0677
 5 0     0             0.954  0.0462
 6 0     0             0.893  0.107 
 7 0     0             0.957  0.0434
 8 1     1             0.228  0.772 
 9 0     0             0.988  0.0117
10 0     0             0.784  0.216 
# ℹ 257 more rows
#ROC plots for "testing" dataset
autoplot(roc_curve(mydata.lr.pred.values.test, 
                   truth, 
                   .pred_0))

#Metrics of prediction on "testing" dataset
metrics(mydata.lr.pred.values.test, truth, .pred_class, .pred_0)
# A tibble: 4 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.816
2 kap         binary         0.571
3 mn_log_loss binary         0.406
4 roc_auc     binary         0.879
conf_matrix_all <- mydata.lr.pred.values.test %>%
  conf_mat(truth = truth, estimate = .pred_class)
print(conf_matrix_all)
          Truth
Prediction   0   1
         0 160  28
         1  21  58

Logistic Regression - Cross Validation (10-fold)

#model building
lr_cls_spec_cv <- 
  logistic_reg() |> 
  set_engine("glm")

#Model fit to FULL dataset
lr_cls_fit_cv <- 
  lr_cls_spec_cv |>
  fit(cohort ~ ., data = imputed_data) #here, I'm using full data.

#Create a workflow() for fitting the glm
glm_wf <- workflow() |>
  add_model(lr_cls_spec_cv) |> 
  add_formula(cohort ~ .)
  
#Cross validation fitting
glm_fit_cv <- 
  glm_wf |>
  fit_resamples(mydata_folds, control = control_resamples(save_pred = TRUE))

#Collect predictions out of folds into one tibble
mydata_glm_cv_preds <- collect_predictions(glm_fit_cv)

#Plot of ROC curve of CV results
autoplot(roc_curve(mydata_glm_cv_preds, 
        cohort, 
        .pred_0))

#To see performance of each fold
mydata_glm_cv_preds |>
  group_by(id) |>
  roc_auc(cohort, .pred_0)
# A tibble: 10 × 4
   id     .metric .estimator .estimate
   <chr>  <chr>   <chr>          <dbl>
 1 Fold01 roc_auc binary         0.911
 2 Fold02 roc_auc binary         0.906
 3 Fold03 roc_auc binary         0.910
 4 Fold04 roc_auc binary         0.929
 5 Fold05 roc_auc binary         0.874
 6 Fold06 roc_auc binary         0.890
 7 Fold07 roc_auc binary         0.899
 8 Fold08 roc_auc binary         0.890
 9 Fold09 roc_auc binary         0.879
10 Fold10 roc_auc binary         0.864
#Showing  roc curves for all folds
mydata_glm_cv_preds |>
  group_by(id) |>
  roc_curve(cohort, .pred_0) |>
  autoplot()

#Overall metrics
collect_metrics(glm_fit_cv)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.836    10 0.00867 Preprocessor1_Model1
2 brier_class binary     0.116    10 0.00436 Preprocessor1_Model1
3 roc_auc     binary     0.895    10 0.00623 Preprocessor1_Model1

Confusion matrix for lr

# Convert probabilities to binary predictions using a threshold
mydata_glm_cv_preds <- mydata_glm_cv_preds |> 
  mutate(
    predicted = ifelse(.pred_1 > 0.5, 1, 0),               # Binary predictions
    predicted = factor(predicted, levels = c(0, 1)),       # Convert to factor
    cohort = factor(cohort, levels = c(0, 1))              # Ensure cohort is a factor
  )

# Calculate the confusion matrix
conf_mat_lr <- conf_mat(mydata_glm_cv_preds, truth = cohort, estimate = predicted)
print(conf_mat_lr)
          Truth
Prediction   0   1
         0 811 128
         1  90 302
# Accuracy
accuracy_lr <- accuracy(mydata_glm_cv_preds, truth = cohort, estimate = predicted)

# Precision
precision_lr<- precision(mydata_glm_cv_preds, truth = cohort, estimate = predicted)

# Recall (Sensitivity)
recall_lr <- recall(mydata_glm_cv_preds, truth = cohort, estimate = predicted)

# F1-Score
f1_lr <- f_meas(mydata_glm_cv_preds, truth = cohort, estimate = predicted, beta = 1)

# Display metrics
metrics_lr <- tibble(
  Metric = c("Accuracy", "Precision", "Recall", "F1-Score"),
  Value = c(accuracy_lr$.estimate, precision_lr$.estimate, recall_lr$.estimate, f1_lr$.estimate)
)
print(metrics_lr)
# A tibble: 4 × 2
  Metric    Value
  <chr>     <dbl>
1 Accuracy  0.836
2 Precision 0.864
3 Recall    0.900
4 F1-Score  0.882

2.5.5. Random forest

Testing/Training

library(randomForest)
library(vip)

Attaching package: 'vip'
The following object is masked from 'package:utils':

    vi
#Model specification/building
rf_spec <- 
  rand_forest(trees = 300,min_n = 5) |> #Min number of data points in node for further split
  set_engine("randomForest", importance = TRUE) |>
  set_mode("classification")

#Model fit
rf_fit <- rf_spec |>
  fit(cohort ~ ., data = mydata_training)
rf_fit
parsnip model object


Call:
 randomForest(x = maybe_data_frame(x), y = y, ntree = ~300, nodesize = min_rows(~5,      x), importance = ~TRUE) 
               Type of random forest: classification
                     Number of trees: 300
No. of variables tried at each split: 4

        OOB estimate of  error rate: 14.38%
Confusion matrix:
    0   1 class.error
0 673  47  0.06527778
1 106 238  0.30813953
#Prediction using the "TEST" dataset
rf_predicted <- bind_cols(
  truth = mydata_test$cohort,
  predict(rf_fit, mydata_test),
  predict(rf_fit, mydata_test, type = "prob")
)

rf_predicted
# A tibble: 267 × 4
   truth .pred_class .pred_0 .pred_1
   <fct> <fct>         <dbl>   <dbl>
 1 0     0             0.87  0.13   
 2 0     0             0.957 0.0433 
 3 0     0             0.593 0.407  
 4 0     0             0.95  0.05   
 5 0     0             0.913 0.0867 
 6 0     0             0.907 0.0933 
 7 0     0             0.98  0.02   
 8 1     1             0.28  0.72   
 9 0     0             0.993 0.00667
10 0     0             0.833 0.167  
# ℹ 257 more rows
#ROC plots for testing data
roc_auc(rf_predicted, truth, .pred_0)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.865
autoplot(roc_curve(rf_predicted, 
                   truth, 
                   .pred_0))

Random Forest Cross Validation

#I will then use cross-validation to re-estimate the predictive ability more fairly; I will use the same folds as created for the glm model above, "mydata_folds"

#workflow setup
rf_workflow <-
  workflow() |>
  add_model(rf_spec) |>
  add_formula(cohort ~ .)

set.seed (1234)

#Use workflow to fit model with each fold of re-sampled data
rf_fit_cv <-
  rf_workflow |>
  fit_resamples(mydata_folds, control = control_resamples(save_pred = TRUE))

#one ROC plot for cross validation datasets
rf_fit_cv |>
  collect_predictions() |>
  roc_curve(cohort, .pred_0)|>
  autoplot()

# Collect predictions for each fold
cv_predictions <- rf_fit_cv |> collect_predictions()

# To see the performance of each fold using ROC AUC
cv_predictions |>
  group_by(id) |>
  roc_auc(cohort, .pred_0)
# A tibble: 10 × 4
   id     .metric .estimator .estimate
   <chr>  <chr>   <chr>          <dbl>
 1 Fold01 roc_auc binary         0.895
 2 Fold02 roc_auc binary         0.917
 3 Fold03 roc_auc binary         0.901
 4 Fold04 roc_auc binary         0.936
 5 Fold05 roc_auc binary         0.896
 6 Fold06 roc_auc binary         0.909
 7 Fold07 roc_auc binary         0.903
 8 Fold08 roc_auc binary         0.889
 9 Fold09 roc_auc binary         0.875
10 Fold10 roc_auc binary         0.880
#calculating overall roc value
overall_roc_data <- cv_predictions |>
  roc_curve(cohort, .pred_0)

#Overall metrics
collect_metrics(rf_fit_cv)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.853    10 0.00886 Preprocessor1_Model1
2 brier_class binary     0.114    10 0.00379 Preprocessor1_Model1
3 roc_auc     binary     0.900    10 0.00564 Preprocessor1_Model1

Creating 10-fold CV ROC curves for visualization

# Calculate ROC data for each fold
fold_roc_data <- cv_predictions |>
  group_by(id) |>
  roc_curve(cohort, .pred_0)

# Calculate AUC for each fold
fold_auc <- cv_predictions |>
  group_by(id) |>
  roc_auc(cohort, .pred_0)
fold_auc
# A tibble: 10 × 4
   id     .metric .estimator .estimate
   <chr>  <chr>   <chr>          <dbl>
 1 Fold01 roc_auc binary         0.895
 2 Fold02 roc_auc binary         0.917
 3 Fold03 roc_auc binary         0.901
 4 Fold04 roc_auc binary         0.936
 5 Fold05 roc_auc binary         0.896
 6 Fold06 roc_auc binary         0.909
 7 Fold07 roc_auc binary         0.903
 8 Fold08 roc_auc binary         0.889
 9 Fold09 roc_auc binary         0.875
10 Fold10 roc_auc binary         0.880
# Calculate the overall ROC curve
overall_roc_data <- cv_predictions |>
  roc_curve(cohort, .pred_0)


# Create labels with AUC for each fold
fold_auc_labels <- fold_auc |>
  mutate(label = paste0(id, " (AUC = ", round(.estimate, 3), ")"))

# Update the fold IDs in cv_predictions with the new labels
cv_predictions <- cv_predictions |>
  left_join(fold_auc_labels, by = "id") |>
  mutate(id = label)

# Plot ROC curves for each fold with AUC in legend and overlay overall ROC curve
cv_predictions |>
  group_by(id) |>
  roc_curve(cohort, .pred_0) |>
  autoplot() +
  geom_line(data = overall_roc_data, aes(x = 1 - specificity, y = sensitivity), 
            color = "black", size = 0.5) +
  labs(
    title = "ROC Curves for Each Fold with Overall ROC Curve",
    x = "False Positive Rate",
    y = "True Positive Rate",
    color = "Fold (AUC)"
  ) +
  theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

#To understand the variables that most contribute tot eh classification, I extract the importance scores using the vip package.
rf_fit |>
  extract_fit_engine() |>
  importance()
                           0          1 MeanDecreaseAccuracy MeanDecreaseGini
DEM002             5.7326700  3.8457654            6.4022148         6.357570
DEM003            18.4183753 21.5419733           27.5409171        57.391747
DEM004            10.4442460 11.4292251           14.5324973        30.405003
DEM005             5.4498311  2.5549600            5.8851177         6.881665
DEM006             2.1803304 13.4656090           10.9847306        25.398586
DEM007             0.6645845 -0.3803391            0.4482167         2.392813
DEM008             5.0653961  7.5854915            8.8289486        11.854272
pain_severity_sum  8.6748412  9.0857619           12.1363970        23.977653
pain_interf_sum    7.4045860  4.0534481            9.1406625        20.939728
dependence_level  21.6320710 29.2482583           33.3413863        69.654263
Anxiety            4.0942284  1.0814015            4.1728970         3.070401
Depression         6.2308213 -1.9540668            4.6804014         3.056101
ca_sum            10.5933176  2.5141526           11.0363022        19.461584
pps_sum           16.0103016 10.8611991           19.1092319        39.940282
suicidality_risk  13.2779833  8.0029821           14.6756127        26.260574
alcohol            4.4854879  3.7875901            5.7912365         6.064034
SIDSSI             2.5458094  1.0475184            2.6231522        14.118774
SSDSSI             9.4056558 10.6321073           13.7347473        22.777796
ISDSSI             4.3026969  2.7976443            5.0745571        14.628410
rf_fit |>
  extract_fit_engine() |>
  vip()

Confusion matrix/ calculation

# Ensure that both 'cohort' (truth) and 'predicted' (estimate) are factors
cv_predictions <- cv_predictions |> 
  mutate(predicted = ifelse(.pred_1 > 0.5, 1, 0))

cv_predictions <- cv_predictions |>
  mutate(
    predicted = factor(predicted, levels = c(0, 1)),  # Convert predicted to a factor
    cohort = factor(cohort, levels = c(0, 1))         # Convert cohort to a factor
  )
# Confusion matrix
conf_mat <- cv_predictions |> 
  conf_mat(truth = cohort, estimate = predicted)
print(conf_mat)
          Truth
Prediction   0   1
         0 836 130
         1  65 300
# Accuracy
accuracy <- accuracy(cv_predictions, truth = cohort, estimate = predicted)

# Precision
precision <- precision(cv_predictions, truth = cohort, estimate = predicted)

# Recall (Sensitivity)
recall <- recall(cv_predictions, truth = cohort, estimate = predicted)

# F1-Score
f1 <- f_meas(cv_predictions, truth = cohort, estimate = predicted, beta = 1)

# Display metrics
metrics <- tibble(
  Metric = c("Accuracy", "Precision", "Recall", "F1-Score"),
  Value = c(accuracy$.estimate, precision$.estimate, recall$.estimate, f1$.estimate)
)
print(metrics)
# A tibble: 4 × 2
  Metric    Value
  <chr>     <dbl>
1 Accuracy  0.853
2 Precision 0.865
3 Recall    0.928
4 F1-Score  0.896

2.5.6 Gradient boosting

Testing/Training

#loading the packages
library(xgboost)

Attaching package: 'xgboost'
The following object is masked from 'package:dplyr':

    slice
#Model specification/building
bt_spec <- 
  boost_tree(trees = 50,        #Number of trees
             tree_depth = 4) |> #Max depth of tree
  set_mode("classification") |>
  set_engine("xgboost")
bt_spec
Boosted Tree Model Specification (classification)

Main Arguments:
  trees = 50
  tree_depth = 4

Computational engine: xgboost 
#Recipe
bt_recipe <-
  recipe(cohort ~ ., data = mydata_training) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_predictors())

#Workflow specification
bt_workflow <- workflow() |>
  add_model(bt_spec) |>
  add_recipe(bt_recipe)

#Model fit with training data
bt_fit <- fit(bt_workflow, data = mydata_training)
bt_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_dummy()
• step_normalize()

── Model ───────────────────────────────────────────────────────────────────────
##### xgb.Booster
raw: 78.6 Kb 
call:
  xgboost::xgb.train(params = list(eta = 0.3, max_depth = 4, gamma = 0, 
    colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 1, 
    subsample = 1), data = x$data, nrounds = 50, watchlist = x$watchlist, 
    verbose = 0, nthread = 1, objective = "binary:logistic")
params (as set within xgb.train):
  eta = "0.3", max_depth = "4", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "1", subsample = "1", nthread = "1", objective = "binary:logistic", validate_parameters = "TRUE"
xgb.attributes:
  niter
callbacks:
  cb.evaluation.log()
# of features: 35 
niter: 50
nfeatures : 35 
evaluation_log:
  iter training_logloss
 <num>            <num>
     1       0.54771753
     2       0.46444679
   ---              ---
    49       0.08439110
    50       0.08169347
#Prediction model using the "TEST" dataset
my.bt.pred <- bind_cols(
  truth = mydata_test$cohort,
  predict(bt_fit, mydata_test),
  predict(bt_fit, mydata_test, type = "prob"))

#ROC plots for testing data
roc_auc(my.bt.pred, truth, 
        .pred_0)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.844
autoplot(roc_curve(my.bt.pred, 
                   truth, 
                   .pred_0))

#Variable importance extracted with vip
vip(bt_fit)

# Variable importance extracted from xgboost object
bt_object <- pull_workflow_fit(bt_fit)$fit
Warning: `pull_workflow_fit()` was deprecated in workflows 0.2.3.
ℹ Please use `extract_fit_parsnip()` instead.
xgb.importance(model = bt_object)
                                                   Feature         Gain
                                                    <char>        <num>
 1:                            dependence_level_Non.smoker 0.2442186014
 2:                                                 DEM003 0.1442582805
 3:                                                pps_sum 0.0885686836
 4:                                      pain_severity_sum 0.0737508088
 5:                                        pain_interf_sum 0.0569315048
 6:                                                 SSDSSI 0.0507570559
 7:                                                 ISDSSI 0.0453771084
 8:                                                 ca_sum 0.0406357276
 9:                              suicidality_risk_Low.Risk 0.0383030074
10:                                                alcohol 0.0340973841
11:                                                 SIDSSI 0.0332746477
12:                                            DEM002_Male 0.0258049563
13:                               DEM004_Married.Partnered 0.0198732658
14:                                   DEM004_Never.married 0.0179435277
15:                                   DEM006_Grade.7.to.12 0.0139061354
16:                              dependence_level_Moderate 0.0094220093
17: DEM006_Graduated.high.school.or.high.school.equivalent 0.0079656672
18:                       DEM006_Graduated.2.years.college 0.0076790359
19:                      DEM005_More.than.one.other.person 0.0075681026
20:                       DEM006_Graduated.4.years.college 0.0065777167
21:                                 DEM008_Are.comfortable 0.0065086678
22:                                DEM005_One.other.person 0.0052698547
23:                                             Anxiety_X1 0.0043604908
24:                                              DEM007_No 0.0034168984
25:                       dependence_level_Low.to.Moderate 0.0033138418
26:               DEM006_Part.graduate.professional.school 0.0029585441
27:                   DEM008_Have.just.enough.to.get.along 0.0028777182
28:                                    DEM006_Part.college 0.0026279278
29:                                   dependence_level_Low 0.0014616925
30:                                          Depression_X1 0.0002911368
                                                   Feature         Gain
          Cover   Frequency
          <num>       <num>
 1: 0.090418465 0.026016260
 2: 0.145948111 0.097560976
 3: 0.114974683 0.131707317
 4: 0.072952827 0.094308943
 5: 0.099729056 0.113821138
 6: 0.056329944 0.076422764
 7: 0.058036081 0.079674797
 8: 0.061761145 0.069918699
 9: 0.027561045 0.013008130
10: 0.029787293 0.039024390
11: 0.052176250 0.058536585
12: 0.033250578 0.022764228
13: 0.011148203 0.009756098
14: 0.008852702 0.008130081
15: 0.013025493 0.013008130
16: 0.009590605 0.013008130
17: 0.009030857 0.008130081
18: 0.018150007 0.013008130
19: 0.005121314 0.016260163
20: 0.009636365 0.014634146
21: 0.009362992 0.011382114
22: 0.007300096 0.014634146
23: 0.002121541 0.008130081
24: 0.011663098 0.011382114
25: 0.008555704 0.008130081
26: 0.015959164 0.006504065
27: 0.010409277 0.009756098
28: 0.002724621 0.008130081
29: 0.002682095 0.001626016
30: 0.001740386 0.001626016
          Cover   Frequency

Cross Validation - Testing/Training

#workflow setup
bt_workflow <-
  workflow() |>
  add_model(bt_spec) |>
  add_formula(cohort ~ .)

set.seed (1234)
#Use workflow to fit model with each fold of re-sampled data
bt_fit_cv <-
  bt_workflow |>
  fit_resamples(mydata_folds, control = control_resamples(save_pred = TRUE))

#ROC plots for cross validation datasets
bt_fit_cv |>
  collect_predictions() |>
  roc_curve(cohort, .pred_0)|>
  autoplot()

#Overall metrics
collect_metrics(bt_fit_cv)
# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.833    10 0.0101  Preprocessor1_Model1
2 brier_class binary     0.122    10 0.00560 Preprocessor1_Model1
3 roc_auc     binary     0.887    10 0.00780 Preprocessor1_Model1

Confusion matrix

# Collect predictions from the cross-validation results
bt_predictions <- bt_fit_cv |>
  collect_predictions()

# Convert probabilities to binary predictions
bt_predictions <- bt_predictions |> 
  mutate(
    predicted = ifelse(.pred_1 > 0.5, 1, 0),               # Binary predictions
    predicted = factor(predicted, levels = c(0, 1)),       # Convert to factor
    cohort = factor(cohort, levels = c(0, 1))              # Ensure cohort is a factor
  )

# Calculate overall metrics
bt_metrics <- bt_predictions |>
  metrics(truth = cohort, estimate = predicted, .pred_1) |>
  filter(.metric %in% c("accuracy", "precision", "recall", "f_meas"))

print(bt_metrics)
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.833
# Accuracy
accuracy_bt <- accuracy(bt_predictions, truth = cohort, estimate = predicted)

# Precision
precision_bt<- precision(bt_predictions, truth = cohort, estimate = predicted)

# Recall (Sensitivity)
recall_bt<- recall(bt_predictions, truth = cohort, estimate = predicted)

# F1-Score
f1_bt <- f_meas(bt_predictions, truth = cohort, estimate = predicted, beta = 1)

# Display metrics
metrics_bt <- tibble(
  Metric = c("Accuracy", "Precision", "Recall", "F1-Score"),
  Value = c(accuracy_bt$.estimate, precision_bt$.estimate, recall_bt$.estimate, f1_bt$.estimate)
)
print(metrics_bt)
# A tibble: 4 × 2
  Metric    Value
  <chr>     <dbl>
1 Accuracy  0.833
2 Precision 0.865
3 Recall    0.893
4 F1-Score  0.879

2.5.7.Model Performance

Three predictive models were implemented to predict opioid use disorder. 10-fold cross validation was conducted to evaluate model performances. All predictive models achieved satisfactory performance When comparing roc_auc values for those 3 algoriths with cross-validation, ramdom forests had the highest value.

#Naming
lr_roc_data <- roc_curve(mydata.lr.pred.values.test, truth, .pred_0)
lr_roc_cv_data <- roc_curve(mydata_glm_cv_preds, cohort, .pred_0)
rf_roc_data <- roc_curve (rf_predicted, truth,.pred_0)
rf_roc_cv_data <- rf_fit_cv |>
  collect_predictions() |>
  roc_curve(cohort, .pred_0)
bt_roc_data <- roc_curve (my.bt.pred, truth, .pred_0)
bt_roc_data <- bt_fit_cv |>
  collect_predictions() |>
  roc_curve(cohort, .pred_0)

#ggplot
ggplot() +
  geom_path(data = lr_roc_data, aes(x = 1 - specificity, 
                                     y = sensitivity, 
                                     color = "blue")) +
  geom_path(data = lr_roc_cv_data, aes(x = 1 - specificity, 
                                        y = sensitivity,
                                        color = "lightblue")) +
  geom_path(data = rf_roc_data, aes(x = 1 - specificity, 
                                       y = sensitivity,
                                       color = "pink")) +
  geom_path(data = rf_roc_cv_data, aes(x = 1 - specificity, 
                                    y = sensitivity, 
                                    color = "red")) +
  geom_path(data = bt_roc_data, aes(x = 1 - specificity, 
                                    y = sensitivity, 
                                    color = "purple")) +
   geom_path(data = bt_roc_data, aes(x = 1 - specificity, 
                                    y = sensitivity, 
                                    color = "Green")) +
  labs(
    title = "Combined Cross-Validated ROC Curves with Confidence Intervals",
    x = "False Positive Rate",
    y = "True Positive Rate",
    color = "Model",
    fill = "Model"
  ) +
  scale_color_manual(values = c("blue", "lightblue", "pink", "red", "purple", "Green"), 
                     labels = c("LR",
                                "LR - 10-fold CV",
                                "RF",
                                "RF - 10-fold CV",
                                "GT",
                                "GT - 10-fold CV")) +  
  geom_abline(lty = 3) +
  coord_equal() +
  theme_bw()

Discussion

The RF model demonstrated the highest overall performance, with an F1-score of 0.8936, a recall of 0.9276, and the highest AUC of 0.8969, indicating superior discriminatory ability. The XGBoost model achieved the highest precision (0.8669) but had a slightly lower AUC (0.8780) compared to the other models. The 10-fold ROC curves (light blue for LR, red for RF, and green for GT) highlight the models’ performance across different subsets of the data. While RF shows the best overall sensitivity across a range of false positive rates, GT underperformed relative to RF and LR, as evidenced by its lower AUC despite visually appearing to follow RF’s trajectory. LR, although slightly less sensitive than RF in some regions, exhibited more consistent performance than GT. These findings indicate that RF offers the best overall predictive performance, while GT may struggle with reliability compared to LR and RF in this specific task.

In the 10-fold models of the random forest algorithm, the AUC values for the folds range from 0.869 to 0.927, indicating consistently strong performance across all folds, with minor variability. And they are closely clustered around the overalll curve, which suggest the model’s stability.

This model has clinical and research implications. In Practice, this model enables to identify high-risk patients early, which can help clinical decision making. Also, feature importance analyss identified top three factors that contribute to the outcome, therefore, interventions need to be developed targeting these factors. Interestingly, these results align with certain findings in genetics, particularly the shared genetic risk factors between nicotine and OUD. Also, this study highlights the role of mental defeat — a factor not yet incorporated into existing OUD models — which should be explored further in future research.This work also lays the foundation for future research by providing a strong predictive framework to support the development of more advanced machine learning models.

Limitation

The results of these analysis need to be interpreted with caution. Importantly, due to the genotypic objective of the parent study, all participants were Caucasian participants, limiting the generalizability of findings to more diverse populations. Secondly, significant number of missing data were observed in two key factors within this model. Despite this, these variables are clinically considered highly relevant to the outcome, and their imputation does not appear to have adversely affected the model’s performance, as discussed previously. Lastly, this model was developed based on a relatively small sample size. However, it was specifically tailored for chronic pain patients on long-term opioid therapy, a population at higher risk of developing OUD compared to the general population. Considering that the model incorporates various significant risk factors that are often excluded from existing machine learning models, this model offers meaningful insights and potential contributions to the field.

Conclusion

Random forest achieved the best predictive performance, with an AUC-ROC of 0.89. Feature importance analysis identified nicotine dependence, age, and mental defeat as the top three significant features.This model examined the data performance using the pre-determined variables. This model showed satisfactory prediction performance, and it can be a useful tool for clinicians to identify patients at risk for developing OUD and implement early intervention for prevention. This model not only offers a strong predictive framework, but also lays the groundwork for more advanced machine learning models that make use of a broader range of variables, ultimately improving early identification and intervention strategies for at-risk patients.